In 2017, a report from Centers for Disease Control and Prevention revealed that America’s obesity rate has reached a record high. In contrast to the popular belief, New Yorkers are not so safe from the obesity epidemic, as more than half of adult New Yorkers are either overweight or obese. Studies show that the rise in the obesity epidemic is partly due to disparities in food environment; it is harder for some to eat healthier because their options are limited.
This project intends to look deeper into the relationship between food environment in NYC and obesity rate along with diabetes rate and stroke hospitalization rate.
1. Can the percentage of nation-wide chain and fast-food restaurants in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
2. Can the percentage of fast-food restaurant (defined by cuisine description in the restaurant inspection dataset) in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
Can the percentage of health inspection grade A restaurant in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
Can the composite restaurant health score of different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
For the first two questions, we stick to it. For the third one, After looking into the backgroud of the restaurant inspection data, we found out that the percentage of health inspection grade A restaurant is intuitively not related to the three chronic disease health outcomes, so we later consider it as a confounder. We also deleted the fourth problem, since composite restaurant health score is too subjective and difficult to assess. Moreover, we fitted models on the borough level first due to the difficulties we encountered in scaping zipcode-neighborhood data. However, after some struggle, we manage to get the neighborhood information and analyze the data accordingly.
1. NYC restaurant inspection:
https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
This dataset contains the data for restaurant inspection in NYC from August 1, 2014 to June 9, 2019. Every row is a restaurant inspection record that includes the name of the restaurant, zipcode, cuisine type description, inspection grade (A as the best grade) and so on.
Variable used in this datasets
DBA: Name of the restaurant
BORO: name of the boro
ZIPCODE: the zipcode of the restaurant
CUISINE DESCRIPTION: the kind of food that the restaurant is providing
GRADE: Inspection grade for the specific inspection.
2. 2015 Community Health Profiles Open Data:
https://www1.nyc.gov/site/doh/data/data-publications/profiles.page
This dataset contains NYC every neighborhoods’ demographic (percentage of white race, poverty percentage), health (age-adjusted percent of adult exercised in the last 30 days, age-adjusted percent of adults as a smoker) and our main outcome (Age-adjusted percent of adults that is obese (BMI of 30 or gthe reater), Age-adjusted percent of adults, Age-adjusted rate of hospitalizations due to stroke (cerebrovascular disease) per 100,000 adults)
Variable used in this datasets
Name: the name for the neigborhood.
Racewhite_Rate: the percentage for white race
Poverty: Percent of individuals living below the federal poverty threshold
Smoking: age-adjusted percent of adults as a smoker Exercise: Age-adjusted percent of adults that reported getting any exercise in the last 30 days
Obesity: Age-adjusted percent of adults that is obese (BMI of 30 or greater) based on self-reported height and weight
Diabetes: Age-adjusted percent of adults that had ever been told by a healthcare professional that they have diabetes
Stroke_Hosp: Age-adjusted rate of hospitalizations due to stroke (cerebrovascular disease) per 100,000 adults
3. Web scraping for what zipcodes each neighborhood contains
Because Community Health Profiles Open Data has only neighborhood level information, so we have to aggregate neighborhood level information from the restaurant inspeciton data. However, the restaurant inspection data is using zipcodes for every restaurant. Therefore, we have to add the neighborhood name to every restaurant, so that we can group by neighborhood then calculate the percentage for every neighborhood.
First, we scraped the table data from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. However the neighborhood name and combinations are different from the health profile data we downloaded. We could not merge the two datasets. Then we tried to look for the raw classification for the neigborhoods in health data from New York University’s Furman Center for Real Estate and Urban Policy and the NYC Department of City Planning. However, still no luck becuase the neighborhood is not divided by zipcode areas. Finally we tried the hardest way: searching the name of the neighborhood on Google and finding the matching zipcodes. Although it is not precise because neighborhoods are not divided according to the area of the zipcode (different neighborhoods sometimes share the same zipcodes), it works out well.
To deal with this issue and get an unbiased result, we randomly assign the zipcode to only one neighborhood if two or more neighborhoods are in the same zipcode area.
First, we will download restaurant inspection data from NYC open data.
get_all_inspections = function(url) {
all_inspections = vector("list", length = 0)
loop_index = 1
chunk_size = 50000
DO_NEXT = TRUE
while (DO_NEXT) {
message("Getting data, page ", loop_index)
all_inspections[[loop_index]] =
GET(url,
query = list(`$order` = "zipcode",
`$limit` = chunk_size,
`$offset` = as.integer((loop_index - 1) * chunk_size)
)
) %>%
content("text") %>%
fromJSON() %>%
as_tibble()
DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
loop_index = loop_index + 1
}
all_inspections
}
url = "https://data.cityofnewyork.us/resource/9w7m-hzhe.json"
rest_inspection = get_all_inspections(url) %>%
bind_rows()
# changing to date
rest_inspection = rest_inspection %>%
mutate(inspection_date = rest_inspection$inspection_date %>%
strtrim(., nchar(.)-13) %>%
as.Date() )
rest_inspection$inspection_date %>%
max()
## [1] "2019-06-08"
Then, we download health and demographic data CSV into local folder, read in and clean it.
download.file("https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2015_CHP_PUD.xlsx", mode="wb", destfile = "health.xlsx")
health <- read_excel("health.xlsx", sheet = "CHP_all_data") %>%
select(Name, Racewhite_Rate, Poverty, Unemployment,
Smoking, Exercise,
Obesity, Diabetes, Stroke_Hosp, Airquality_rate) %>%
clean_names()
Lastly, we match the neighborhood names with restaurant zipcodes.
zip_neighbor <- read_csv("neigh_zipcode.csv") %>%
mutate(zipcode = as.character(zipcode))
##restaurant data with neighbourhood
rest_neighborhood = left_join(rest_inspection, zip_neighbor, by = "zipcode") %>%
filter(!is.na(neighborhood))
The health data we achieved is very well-structured. We make plots to see the distribution of the three chronic disease outcomes across neighorhoods.
boro = rest_neighborhood %>% select(neighborhood, boro)
health_only_neighborhood <- health[-c(1:6),] %>%
rename(neighborhood = name) %>%
left_join(boro) %>%
mutate(neighborhood = as.factor(neighborhood))
## Joining, by = "neighborhood"
##Plotting for outcome in different neighborhood
bar_obe <- health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, obesity)) %>%
filter(obesity > 25) %>%
ggplot(aes(x = neighborhood,y = obesity, fill = boro)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Neighborhood with 25 percent or more obesity rate")
bar_dia <- health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, diabetes)) %>%
filter(diabetes > 10) %>%
ggplot(aes(x = neighborhood,y = diabetes, fill = boro)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Neighborhood with 10 percent or more diabetes rate")
bar_stro <- health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, stroke_hosp)) %>%
filter(stroke_hosp > 300) %>%
ggplot(aes(x = neighborhood,y = stroke_hosp, fill = boro)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Neighborhood with 300 or more stroke hospitalization in 100,000 adults")
ggplotly(bar_obe)